Metabolism is a major regulator of immune cell function, but it remains difficult to study the metabolic status of individual cells. This motivated the development of Compass, an algorithm to characterize cellular metabolic states based on single-cell RNA-Seq and flux balance analysis (FBA).
A metabolic flux is the instantaneous rate in which a chemical reaction occurs (i.e., the amount of a metabolite processed by one or more catalytic steps per unit time). Compass belongs to the family of static FBA algorithms, which assumes metabolic steady-state (the state of a metabolic pathway can be expressed as a vector, which denotes the yield of the compounds or the flux in that pathway at a given time. The steady state is a state that remains unchanged over time).
Compass leverages prior knowledge about the metabolic network of the cells in question in the form of a Genome-Scale Metabolic Model (GSMM), which includes the following components:
A stoichiometric matrix \(S\) which describes the set of possible metabolic reactions in the system. Its rows correspond to metabolites, columns to reactions, and entries hold the stoichiometric coefficients of the reactions available to the cells. Reactions for uptake and secretion of a metabolite are encoded as having only a coefficient of 1 and −1 in the metabolite’s row entry, respectively, and 0 otherwise.
Reactions are partitioned into cellular compartments representing membrane-enclosed space in which a biochemical reaction may take place, for example the cytosol, Golgi, or mitochondria.
Metabolites are represented separately for each compartment. For example, cytosolic citrate and mitochondrial citrate correspond to two rows in \(S\).
Reactions may contain non-zero coefficients for metabolites located in different compartment to represent physiological transportation of metabolites across membranes. These reactions are called Transport reactions.
The extracellular space is represented as an additional compartment.
Reactions may have exactly one non-zero coefficient if the corresponding metabolite is located in the extracellular space. These reactions are called Exchange reactions and used to import metabolites into or export them from the system. It is standard convention to assign −1 as the non-zero coefficient of exchange reactions.
Upper and lower bounds on fluxes through the reactions corresponding to columns of \(S\).
A set of genes coding enzymes that catalyze reactions in the network
Gene-to-reaction associations:
Every reaction is associated with a boolean expression over gene literals.
Every gene is assigned a truth value based on its presence or absence in the cell’s genome. A gene may be absent, for example, in a knockout genotype.
The boolean expressions code the dependency of reactions on their catalyzing proteins. Usually, an OR relationship corresponds to isozymes, namely two enzymes capable of catalyzing the same reaction, and AND relationship corresponds to enzyme complexes.
Some notation:
the set of cells \(C\) has a total of \(n\) cells (or RNA libraries) and \(g\) genes;
the set of metabolic reactions in the GSMM, \(R\), has a total of \(m\) metabolic reactions;
\(rev(r)\) is the reverse unidirectional reaction of reaction \(r\), which has the same stoichiometry of \(r\) but proceeds in the opposite direction.
The Compass algorithm can be summarized in few steps:
Trascriptome-agnostic preparatory step.
In this step, for a given GSMM (e.g., Recon2), a preparatory step that does not depend on transcriptome data is run and its results are cached for the next steps. The maximal flux \(v_r^{opt}\) , where \(v\in \mathbb{R}^m\), that every reaction \(r\) can carry is computed under some constraints:
the system is in a steady-state \(S \cdot v = 0\);
directionality and capacity limits for reactions, including uptake and secretion limits are used \(\alpha\le v \le\beta\);
when evaluating the maximum flux for each reaction, its reverse reaction carries no flux \(v_{rev(r)}=0\) (to avoid the creation of a futile cycle).
From gene expression to reaction expression.
A reaction expression matrix is similar to the gene expression matrix \(G_{g \times n}\) but rows represent single metabolic reactions \(R_{m\times n}\). This matrix is created by using the boolean gene-to-reaction mapping included in the GSMM.
If a single gene with linear-scale expression \(x\) is associated with \(r\), the reaction’s expression will be \(R_{r,j}=\log_{2}(x+1)\).
If no genes are associated with \(r\) then \(R_{r,j}=0\).
If the reaction is associated with more than one gene, then the reaction’s expression will be defined as the sum (ORs) or mean of linear-scale (ANDs).
Information sharing between single cells (smoothing).
To mitigate the sparseness and stochasticity of single-cell measurements, Compass allows for a degree of information-sharing between cells with similar transcriptional profiles. Given a gene expression \(G\), we compute k-nearest neighbors (kNN) graph based Euclidean distances in reduced dimension, obtained by taking the top 20 principal components of \(G\). The PCA is computed over all the genes in \(G\), not only metabolic ones:
let \(R=r_{i,j}\) and \(w_{i,j} = \frac{1}{k}\) if cell \(j\) is in the k-nearest-neighborhood of cell \(i\);
then \(R^N=r_{i,j}^N\) where \(r_{i,j}^N=\sum_{c\in C} w_{j,c}r_{i,c}\)
Main algorithm.
Compass transforms a gene expression matrix \(G_{g\times n}\) into a matrix \(C_{m\times n}\) of scores, where rows represent metabolic reactions and the columns are the same RNA libraries as in the gene expression matrix. Each entry represents a proxy for potential reaction activity. More precisely, the entry quantifies the propensity of the cell to use that reaction. Here the steps:
Parameters:
smoothing parameter \(\lambda \in [0,1]\) (\(\lambda = 0.25\));
nearest neighbor parameter \(k\) (\(k=10\));
optimality slack parameter \(w\in[0,1]\) (\(w=0.95\));
penalty function \(p(x)\) (\(p(x)=\frac{1}{1+x}\))
meta-reaction merging threshold \(\rho\in[0,1]\).
Conversion of \(G_{g\times n}\) gene expression matrix into \(R_{m\times n}\) reaction expression matrix and a neighborhood reaction expression matrix \(R_{m\times n}^N\).
Conversion of \(R_{m\times n}\) reaction expression matrix into \(P_{m\times n}\) penalty matrix by point-wise inversion \(P_{m\times n} = p(R_{m\times n})\). While \(R\) represents gene expression support that a reaction is functional in cell, \(P\) represents the lack if it. The computation of \(R\) and \(P\) occurs also for the neighborhood of each cell for smoothing \(P_{m\times n}^N = p(R_{m\times n}^N)\). Finally, \(\hat P=(1-\lambda)P+\lambda P^N\).
For each reaction \(r\in R\) and \(c \in C\) the following steps are computed:
let \(\hat P^{(c)}=(\hat P_{1,c},...,\hat P_{m,c})\)
\(\DeclareMathOperator*{\minimize}{minimize} y_{r,c}:=\mathop{\mathrm{minimize}}_{v \in R^m} \hat P^{(c)}\cdot v\) subject to:
\(S\cdot v=0\);
\(\alpha \le v \le \beta\);
\(v_{rev(r)}=0\)
\(v_r\ge w\cdot v_r^{opt}\)
So, an high penalty \(y_{r,c}\) indicates that cell \(c\) is unlikely, judged by transcriptomic evidence, to use reaction \(r\). For all the reactions and cells, \(C^{raw} = [y_{r,c}]_{m\times n}\);
Rows in the \(C^{raw}\) matrix that correspond to reactions that are topologically close in the metabolic network can be highly correlated. Hierarchical clustering on Spearman distance of rows, can be used to obtain clusters of meta-reactions which are formed by closely correlated metabolic reactions. This step is data-driven and does not rely on canonical metabolic pathway definitions.
After computing hierarchical clusters over rows of \(C^{raw}\), leaves where Spearman similarity (\(1-\rho\)) is higher than a fixed threshold are averaged together (e.g., \(\rho=0.98\)).
Meta-reaction scores are computed\(C^{meta-raw} = [y_{r,c}^{\prime}]_{m^{\prime \prime}\times n}\) where \(m^{\prime \prime} \le m\):
\(C^{meta-raw} :=-\log(1+C^{meta-raw})\)
\(C = C^{meta-raw}-min(C^{meta-raw})\)
Constant rows from \(C\) are removed. A row is defined as constant if the difference between the largest and smallest score is less than \(\epsilon=10^{-3}\).
Experimental Autoimmune Encephalomyelitis (EAE) is the mouse model of multiple sclerosis. The original experiment was set to compare the leukocytes that are in the spinal cord in the diseased toward its control in the two stages of disease that are onset, when the first visible clinical signs appear, and chronic that is when the disease starts to have chronic clinical progression. Controls are not healthy mice, as they were injected with a mixture that recreates inflammation that is not EAE specific.
The assumption is that these diseases are influenced, if not induced, by precisely this migration of leukocytes from the blood into central nervous tissue. The results of the single-cell RNA-sequencing (scRNA-seq) were analysed exploiting a workflow of singleCellExperiment. And during the process also the cell types were determined.
In this example, we use the single-cell data from the above-explained experiment and we focus on the CD4 cells considering both onset and chronic stages of the EAE mice (no controls).
The input gene expression matrix can be a tab-delimited text file (tsv) containing normalized and batch-effects-cleaned gene expression estimates (CPM, TPM, or similar scaled units) with one row per gene, one column per sample. Tab-delimited files need row and column labels corresponding to genes and sample names.
To run COMPASS the following command has been used:
Compass [-h] [--data FILE] [--data-mtx FILE [FILE ...]]
[--model MODEL] [--species SPECIES] [--media MEDIA]
[--output-dir DIR] [--temp-dir DIR] [--torque-queue QUEUE]
[--num-processes N] [--lambda F] [--num-threads N] [--and-function FXN]
[--select-reactions FILE] [--select-subsystems FILE] [--num-neighbors N]
[--symmetric-kernel] [--input-weights FILE] [--penalty-diffusion MODE]
[--no-reactions] [--calc-metabolites] [--precache] [--input-knn FILE]
[--output-knn FILE] [--latent-space FILE] [--only-penalties]
[--example-inputs] [--microcluster-size C] [--list-genes FILE]
[--list-reactions FILE]
Several runs have been performed:
compass --data linear_gene_expression_matrix.tsv
--num-processes 20
--species mus_musculus
--output-dir no_lamda > out 2> log
compass --data linear_gene_expression_matrix.tsv
--num-processes 20
--species mus_musculus
--lambda 0.25
--output-dir lamda > out 2> log
compass --data linear_gene_expression_matrix.tsv
--num-processes 20
--species mus_musculus
--lambda 0.25
--num-neighbors 10
--output-dir lamda_met_k10
--calc-metabolites
The last one is used for the following analyses.
As specified earlier, one of the intermediate outputs of COMPASS is a matrix of penalty scores about the reactions expressions. All the steps which bring the matrix of penalty scores, \(P_{m\times n}\), to the matrix of reaction consistencies, \(C_{m\times n}\), are taken into account in the post-processing steps.
Two similar, almost equivalent, packages are available to manipulate COMPASS outputs and perform this and other statistical analyses. The first in python and the second in R. While the python version is still maintained, the R version is going to be deprecated soon. However, we modified the R package called compassR to make it equivalent to his python counterpart.
The modified compassR version is in this repository. To install it the following command can be run:
devtools::install_github("mcalgaro93/compassR")
Then, we load some packages for data manipulation and visualization.
To perform COMPASS postprocessing some settings must be chosen. By setting the class object compassSettings, it is possible to define:
the metabolic model directory;
the genes, metabolites, reactions, and cells metadata;
the user data directory;
the COMPASS’s output with the reaction penalties;
the gene expression input matrix;
column names for cells and genes;
some naming option for the direction of reactions;
the minimum range of variation for a reaction to be kept (not variable reactions are removed);
the cut height for the hierarchical clustering used to define meta-reactions (
user_data_directory <- "../../data/eae/tpm_counts_normalized"
# user_data_directory <- "../../data/eae/tpm_counts"
settings <- CompassSettings$new(
metabolic_model_directory = system.file("extdata",
"RECON2", package = "compassR"),
gene_metadata_file = "gene_metadata.csv",
metabolite_metadata_file = "metabolite_metadata.csv",
reaction_metadata_file = "reaction_metadata.csv",
user_data_directory = user_data_directory,
cell_metadata_file = "cell_metadata.csv",
compass_reaction_scores_file = "reactions.tsv",
linear_gene_expression_matrix_file = "linear_gene_expression_matrix.tsv",
cell_id_col_name = "cell_id",
gene_id_col_name = "MGI.symbol",
reaction_direction_separator = "_",
reaction_directions = c("pos", "neg"),
min_reaction_consistency = 1e-04, # Not used
min_reaction_range = 1e-03,
cluster_strength = 0.02 # Spearman correlation 0.98
)
To run the COMPASS postprocessing the CompassData object must be created. This procedure is quite time expensive. For this reason we create this object once and then we can load it.
if(!file.exists("../../data/eae/eae_cd4_compass_data_mgi.rds")){
compass_data <- CompassData$new(settings = settings)
saveRDS(compass_data,
file = "../../data/eae/eae_cd4_compass_data_mgi.rds")
} else compass_data <- readRDS("../../data/eae/eae_cd4_compass_data_mgi.rds")
The compassData object contains several informations:
the reaction and meta-reaction consistencies;
metabolic genes;
genes, cells, reactions, and metabolites metadata
compass_data
## CompassData:
## reaction_consistencies data frame (6386 reactions x 224 cells)
## metareaction_consistencies data frame (1475 metareactions x 224 cells)
## metabolic_genes tibble (11930 genes x 2 fields)
## gene_expression_statistics tibble (224 cells x 4 fields)
## cell_metadata tibble (224 cells x 12 fields)
## gene_metadata tibble (1733 genes x 5 fields)
## metabolite_metadata tibble (5063 metabolites x 7 fields)
## reaction_metadata tibble (7440 reactions x 7 fields)
## reaction_partitions tibble (6386 reactions x 4 fields)
To proceed with the analysis, a CompassAnalyzer object is created. Two functions are natively available:
conduct_wilcoxon_test() to conduct a wilcoxon test to compare reaction differential consistency between group of cells;
get_umap_components() to ordinate the cells based on metabolic profiles.
compass_analyzer <- CompassAnalyzer$new(settings)
The Uniform Manifold Approximation and Projection (UMAP) could be used as dimensional reduction method. UMAP is a non-linear dimensional reduction method which is very effective for visualizing clusters or groups of data points and their relative proximities.
# As cell_id are numbers, they have been preceeded by the 'X'
# character. For this reason, we remove the character.
compass_data$gene_expression_statistics$cell_id <-
compass_data$gene_expression_statistics$cell_id %>%
gsub(pattern = "X", replacement = "")
# We set a seed for UMAP reproducibility
set.seed(123)
# We perform the UMAP and we add metadata informations
cell_info_with_umap_components <-
compass_analyzer$get_umap_components(
compass_data$reaction_consistencies
) %>%
inner_join(
compass_data$cell_metadata,
by = "cell_id"
) %>%
left_join(
compass_data$gene_expression_statistics,
by = "cell_id"
)
# UMAP components are computed as characters.
# We convert them in numeric values
cell_info_with_umap_components$component_1 <-
as.numeric(cell_info_with_umap_components$component_1)
cell_info_with_umap_components$component_2 <-
as.numeric(cell_info_with_umap_components$component_2)
In Figure 3.1 promiscous groups of cells are clearly visible. Each group contains cells of both stages. At first sight we can say that the first two UMAP dimensions don’t separate cells based on their stage.
p_m_stage <- ggplot(cell_info_with_umap_components, aes(x = component_1,
y = component_2, color = stage)) +
geom_point(alpha = 0.8) +
theme(legend.position = "bottom") +
ggtitle("UMAP colored by stage (Chronic or Onset)") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
theme(legend.position = "bottom")
p_m_stage
Figure 3.1: Each dot represents a single cell coloured based on their stage (Chronic or Onset).
In figure 3.2 we can’t see group of cells containing only one single cell type. Thus, the first two UMAP dimensions seem not to separate cells based on their stage.
p_m_ct <- ggplot(cell_info_with_umap_components,
aes(x = component_1, y = component_2, color = pruned_fine)) +
geom_point(alpha = 0.8) +
ggtitle("UMAP colored by cell type") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_manual(values = RColorBrewer::brewer.pal(12, "Set3")) +
theme(legend.position = "bottom") +
guides(color = guide_legend(title = "Cell type", ncol = 2))
p_m_ct
Figure 3.2: Each dot represents a single cell coloured based on their cell type.
On the summary of our compassData object we have gene_expression table containing the columns total_expression, metabolic_expression, and metabolic_activity. A cell’s “total expression” is the extent to which it expresses any of its genes. Its “metabolic expression” is the extent to which it expresses its metabolic genes. And finally, its “metabolic activity” is the ratio of its metabolic expression to its total expression. In Figure 3.3, cells within the same group present different metabolic activities. Therefore they don’t seem to be segregated based on their metabolic activity levels. Also here we see a very narrow range of metabolic activity values.
# umap according to metabolic activity
p_m_ma <- ggplot(cell_info_with_umap_components,
aes(x = component_1, y = component_2, color = metabolic_activity)) +
geom_point(alpha = 0.8) +
theme(legend.position = "bottom") +
ggtitle("UMAP colored by metabolic activity") +
xlab(label ="UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_viridis_c() +
theme(legend.position = "bottom")
p_m_ma
Figure 3.3: Each dot represents a single cell coloured based on their metabolic activity.
None of the above features alone differentiate our cells. For this reason a possible solution to disclose the cells’ characteristics is to group them in clusters based on the first two dimensions of the UMAP plot. Then, we try to discover what are the characteristics of each cluster.
Firstly, to automatically detect the number of clusters, we will use the function get_optimal_clusters() to identify the number of clusters which maximizes the silhouette index.
opt_cl <- get_optimal_clusters(UMAPcoords = cell_info_with_umap_components[, c("component_1", "component_2")], k.values = 2:15, plot = TRUE)
Figure 4.1: Average silhouette indexes for several number of clusters. The hierarchical clustering algorithm on ‘euclidean’ distances using the ‘average’ agglomeration method has been used.
According to the average silhouette values in Figure 4.1, the optimal number of clusters is 6.
To obtain the cluster membership for each cell, the function get_cluster() can be used. The number of cells in each cluster, stratified by cell type and stage is summarized in Figure 4.2.
cell_info_with_umap_components$cluster <- as.factor(get_cluster(
df = cell_info_with_umap_components[, c("component_1", "component_2")],
k = opt_cl))
# kableExtra::kable(
# x = table(cell_info_with_umap_components$cluster),
# caption = "Number of cells for each cluster.",
# col.names = c("Cluster", "N Cells"), booktabs = TRUE)
# Count how many cells for each cluster, stage, cell type
df_cluster <- plyr::ddply(.data = cell_info_with_umap_components,
.variables = ~ cluster + stage + pruned_fine,
function(cells) return(data.frame("n" = nrow(cells))))
# Count how many cells for each cluster
df_cluster_summary <- df_cluster %>%
group_by(cluster) %>%
dplyr::summarize(n = sum(n)) %>%
arrange(desc(n))
# Order the cluster levels for the summary
df_cluster_summary$cluster <- factor(df_cluster_summary$cluster,
levels = df_cluster_summary$cluster,
labels = df_cluster_summary$cluster, ordered = TRUE)
# Order the cluster levels for the df
df_cluster$cluster <- factor(df_cluster$cluster,
levels = df_cluster_summary$cluster,
labels = df_cluster_summary$cluster, ordered = TRUE)
# Generate table with the frequencies
main_plot <- ggplot(data = df_cluster,
aes(x = cluster, y = pruned_fine)) +
# coord_equal() +
geom_tile(aes(fill = pruned_fine, alpha = n), width = 0.8, height = 0.8) +
geom_text(aes(label = n)) +
facet_grid(stage ~ ., scales = "free", space = "free") +
scale_x_discrete(breaks = 1:6) +
ylab("Cell type") +
scale_fill_manual(guide = "none", values = RColorBrewer::brewer.pal(n = 12, "Set3")) +
scale_alpha_continuous(guide = "none")
# Generate a barplot with the total cells in each clusters
over_strip <- ggplot(data = df_cluster_summary,
aes(x = cluster, y = n, fill = cluster)) +
geom_col(aes(alpha = n), width = 0.8) +
geom_text(aes(label = n, y = n + 5)) +
scale_x_discrete(breaks = df_cluster_summary$cluster) +
scale_alpha_continuous(guide = "none") +
theme_void() +
scale_fill_manual(values = RColorBrewer::brewer.pal(6, "Set2"), breaks = c(1,2,3,4,5,6)) +
theme(legend.position = "none")
# Combine the two plots
cowplot::plot_grid(over_strip, main_plot, ncol = 1, axis = "lr", align = "v", rel_heights = c(1,5))
Figure 4.2: Cell types distribution across stages and clusters.
The UMAP representation is visible in Figure 4.3.
p_m_cluster <- ggplot(cell_info_with_umap_components,
aes(x = component_1, y = component_2, color = cluster)) +
geom_point(alpha = 0.8) +
ggtitle("UMAP colored by cluster") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_manual(values = RColorBrewer::brewer.pal(6, "Set2")) +
theme(legend.position = "bottom") +
guides(color = guide_legend(title = "Cluster", nrow = 1))
p_m_cluster
Figure 4.3: Each dot represents a single cell coloured based on their cluster membership.
Before moving on with deeper analysis of these clusters, it would be interesting to investigate the UMAP representation based on the gene expression values with the information about the cluster membership of the cells based on their metabolic profile. In case of systematic patterns, we shall assume that there might be a direct association between gene expression levels and their reactions.
set.seed(123)
# load gene expression matrix
gene_exp_levels <- read.csv(file = settings$linear_gene_expression_matrix_path,
header = TRUE, sep = "\t", row.names = 1)
# calculate umap
gene_exp_umap_comps <- data.frame(uwot::umap(t(gene_exp_levels)))
colnames(gene_exp_umap_comps) <- c('component_1RNA', 'component_2RNA')
gene_exp_umap_comps$cell_id <- gsub(pattern = "X", replacement = "",
x = rownames(gene_exp_umap_comps))
# We perform the UMAP and we add metadata informations
cell_info_with_umap_components <-
cell_info_with_umap_components %>%
left_join(
gene_exp_umap_comps,
by = "cell_id")
In Figure 4.4), the cells are UMAP-ordinated by expression levels and colored by the membership we previously calculated using the reaction consistencies (panel a). By eye, we only obtain 2 distinguishable clusters and no patterns seem to be visible regarding the cluster membership based on the reaction consistencies. In the other panels, the stage (b), cell type (c), and metabolic activity (d) are represented.
p_a <- ggplot(cell_info_with_umap_components,
aes(x = component_1RNA, y = component_2RNA, color = as.factor(cluster))) +
geom_point(alpha = 0.8) +
ggtitle("UMAP (based on gene expression) colored by cluster membership") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_manual(values = RColorBrewer::brewer.pal(6, "Set2")) +
theme(legend.position = "bottom") +
guides(color = guide_legend(title = "Cluster"))
p_b <- ggplot(cell_info_with_umap_components,
aes(x = component_1RNA, y = component_2RNA, color = stage)) +
geom_point(alpha = 0.8) +
ggtitle("UMAP (based on gene expression) colored by stage") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_manual(values = RColorBrewer::brewer.pal(6, "Set2")) +
theme(legend.position = "bottom")
p_c <- ggplot(cell_info_with_umap_components,
aes(x = component_1RNA, y = component_2RNA, color = pruned_fine)) +
geom_point(alpha = 0.8) +
ggtitle("UMAP (based on gene expression) colored by cell type") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_manual(values = RColorBrewer::brewer.pal(12, "Set3")) +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 6, byrow = TRUE))
p_d <- ggplot(cell_info_with_umap_components,
aes(x = component_1RNA, y = component_2RNA, color = metabolic_activity)) +
geom_point(alpha = 0.8) +
ggtitle("UMAP (based on gene expression) colored by metabolic activity") +
xlab(label = "UMAP 1") +
ylab(label = "UMAP 2") +
scale_color_viridis_c() +
theme(legend.position = "bottom")
cowplot::plot_grid(p_a, p_b, p_c, p_d, nrow = 2, align = "hv", axis = "tblr", labels = "auto")
Figure 4.4: Each dot represents a single cell coloured by cluster membership computed from the reaction consistencies (a), stage (b), cell type (c), and metabolic activity (d). The UMAPs are based on the gene expression values.
We can clearly see a separation in the UMAP based on gene expressions for the stage (panel b) variable and a slightly visible gradient for metabolic activity (panel d) with darker dots in the bottom-left corner of the UMAP.
cowplot::plot_grid(p_m_cluster, p_m_stage, p_m_ct, p_m_ma, nrow = 2, align = "hv", axis = "tblr", labels = "auto")
Figure 4.5: Each dot represents a single cell coloured by cluster membership computed from the reaction consistencies (a), stage (b), cell type (c), and metabolic activity (d). The UMAPs are based on the gene expression values.
Firstly, we compare the metabolic reactions which are cluster-specific. To do so, we perform a Wilcoxon test to detect differentially consistent reactions of cells in a specific cluster against all the other cells.
Firstly, we extract the cell_ids for the groupA, corresponding to cells in cluster 1, and the groupB, corresponding to all the other cells.
group_A_cell_ids <-
cell_info_with_umap_components %>%
filter(cluster == 1) %>%
pull(cell_id)
group_B_cell_ids <-
cell_info_with_umap_components %>%
filter(cluster != 1) %>%
pull(cell_id)
We then perform the wilcoxon test.
wilcoxon_results <- compass_analyzer$conduct_wilcoxon_test(
compass_data$reaction_consistencies,
group_A_cell_ids,
group_B_cell_ids,
for_metareactions = FALSE
)
And we attach to the list of results some additional information regarding the reactions, such as the subsystem they belong, their name, the reaction formula, the associated genes, and others.
wilcoxon_results_with_metadata <- wilcoxon_results %>%
mutate(
reaction_no_direction = gsub(
pattern = "_pos|_neg",
replacement = "",
x = reaction_id),
group = ifelse(adjusted_p_value <= 0.05,
yes = ifelse(cohens_d >= 1,
yes = "Significant and Effect Size",
no = "Significant"),
no = ifelse(cohens_d >= 1,
yes = "Effect Size",
no = "Not Significant"))
) %>%
inner_join(
compass_data$reaction_metadata,
by = "reaction_no_direction"
) %>%
mutate(
core = confidence %in% c(0,4) & !is.na(EC_number)
)
wilcoxon_results_with_metadata$core = factor(
wilcoxon_results_with_metadata$core,
levels = c(TRUE, FALSE),
labels = c("Core Metabolism", "Not Core Metabolism"))
All this information is available in the following Table where reported reactions:
are significant, i.e., adjusted p-value < 0.05;
have a known E.C. number (the Enzyme Commission number is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. As a system of enzyme nomenclature, every EC number is associated with a recommended name for the corresponding enzyme-catalyzed reaction).
have a confidence parameter equal to 0 (denoting reactions whose confidence was not evaluated) or 4 (denoting reactions whose confidence is supported by biochemical evidence).
This allow us to identify the differentially consistent reactions of the core metabolism (based on reaction metadata included in the Recon2 database). Since pathways generally considered part of primary metabolism are also the best studied ones, we define a reaction as belonging to core metabolism if (a) its Recon2 confidence is either 0 or 4; and (b) it is annotated with an EC (Enzyme Commission) number. We chose to label reactions with unevaluated confidence (i.e., Recon2 confidence score of 0) as part of core metabolism because some of them were found to be key reactions in primary metabolic pathways based on manual correction. Our definition of core metabolism is equivalent to taking the set of all metabolic reactions in Recon2, but excluding reactions that either don’t have an annotated EC number or for which the Recon2 curators explicitly specified they do not have direct biochemical support.
DC_core_metabolism <- wilcoxon_results_with_metadata %>%
filter(adjusted_p_value < 0.05 & core == "Core Metabolism") %>%
select(reaction_id, reaction_name, subsystem,
associated_genes, cohens_d, adjusted_p_value)
DT::datatable(DC_core_metabolism,
caption = "Cluster 1 vs All - Differentially consistent core metabolism reactions", filter = "top", extensions = 'Buttons',
options = list(dom = 'Blfrtip', buttons = c('copy', 'csv', 'excel'),
lengthMenu = list(c(10,25,50,-1), c(10,25,50,"All"))))
The effect size is summarized by the Cohen’s D statistic, which is computed as the difference between the mean of the two groups, divided by the pooled standard deviation. Positive values represent reactions which are more consistent in Cluster 1, while negative reaction represent reactions which are more consistent in all the other clusters. To summarize the results, several plots are proposed. In the first place, Figure 4.6 shows the relationship between effect sizes and adjusted p-values through a volcano plot.
plot_ly(
data = wilcoxon_results_with_metadata,
x = ~ cohens_d,
y = ~ -log10(adjusted_p_value),
text = ~ reaction_name,
mode = "markers",
color = ~ group,
symbol = ~ core,
type = "scatter") %>%
layout(title ="Volcano Plot - Cluster 1 vs All")
Figure 4.6: Volcano plot. Each dot represents a reaction. Not significant reactions are closer to the origin of axes. The higher, the lower the adjusted p-value. The righter or lefter, the stronger the effect size.
To furtherly summarize the results about reactions’ subsystems, Figure 4.7 shows the number of significant reactions for each subsystem, dividing them by core and not core reactions.
sig_subsystem <- wilcoxon_results_with_metadata %>%
filter(adjusted_p_value <= 0.05)
sig_subsystem_summary <- sig_subsystem %>%
group_by(subsystem, core) %>%
dplyr::summarise(
pos = sum(cohens_d > 0),
neg = -sum(cohens_d < 0))
df_to_plot <- reshape2::melt(data = sig_subsystem_summary)
ggplot(data = df_to_plot, mapping = aes(x = value, y = reorder_within(
x = subsystem,
by = value,
within = core), fill = variable)) +
facet_wrap(facets = ~ core, nrow = 1, scales = "free") +
geom_col(orientation = "y") +
geom_text(aes(label = ifelse(value > 0, value, "")), hjust = "right") +
geom_text(aes(label = ifelse(value < 0, abs(value), "")), hjust = "left") +
scale_x_continuous(trans = ggallin::pseudolog10_trans) +
scale_y_reordered() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0)) +
theme(legend.position = "top") +
ylab("Subsystem") + xlab("Number of reactions") +
ggtitle(
label = "Number of differentially consistent reactions",
subtitle = "Colored by direction and stratified by core metabolism") +
scale_fill_discrete(name = "More consistent in:", labels = c("Cluster 1", "All the other clusters"))
Figure 4.7: Number of significant reactions by subsystem, stratified by core and not core reactions.
Similarly, Figure 4.8 shows the Cohen’s D values for the significant reactions for each subsystem, by core and not core reactions, ordered by median Cohen’s D value.
ggplot(data = sig_subsystem,
mapping = aes(
x = cohens_d,
y = reorder_within(
x = subsystem,
by = cohens_d,
within = core,
fun = median),
color = as.factor(sign(cohens_d)))) +
facet_wrap(~ core, scales = "free") +
scale_y_reordered() +
geom_jitter(width = 0, height = 0.3) +
geom_vline(aes(xintercept = 0), lty = 2, color = "grey") +
theme(legend.position = "top") +
ylab("Subsystem") + xlab("Cohen's D Effect Size") +
ggtitle(
label = "Effect size of the differentially consistent reactions",
subtitle = "Colored by direction and stratified by core metabolism, ordered by median Cohen's D") +
scale_color_discrete(name = "More consistent in:", breaks = as.factor(c(1,-1)), labels = c("Cluster 1", "All the other clusters"), type = rev(scales::hue_pal()(2)))
Figure 4.8: Cohen’s D values for significant reactions by subsystem, stratified by core and not core reactions, ordered by median value.
The same procedure performed to detect differentially consistent reactions can be performed to detect differentially consistent metareactions. Instead of repeating the whole process, the compare_cluster() function with for_metareactions = TRUE can be used. Given the labels of the two groups of cells to be compared, it returns:
the matrix of wilcoxon results with reactions metadata;
the differentially consistent reactions or metareactions belonging to the core metabolism;
the same table in a ready-to-plot format;
two plot visualization of the differentially consistent reactions.
meta_1_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 1,
cluster_B = c(2,3,4,5,6),
for_metareactions = TRUE,
adjusted_p_value_th = 0.05,
cohens_d_th = 1)
This time, the number of reactions and the number of their subsystems are reported for each metareaction, together with the Cohen’s D effect size and adjusted p-value.
meta_1_all$DC_core_metabolism_table
To inspect the reactions of few meta-reaction groups, the Figure 4.9 can be created:
plot_metareaction(
met_id = c("group_5","group_111", "group_155"),
wilcoxon_results_with_metadata = meta_1_all$wilcoxon_results_with_metadata
)
Figure 4.9: Alluvium graphical representation for metareaction groups of reactions, their subsystem and metabolism type.
res_cl2_vs_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 2, cluster_B = c(1,3,4,5,6),
for_metareactions = FALSE)
res_cl2_vs_all$DC_core_metabolism_table
res_cl2_vs_all$p_DC_reactions
Figure 4.10: Number of significant reactions by subsystem, stratified by core and not core reactions.
res_cl3_vs_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 3, cluster_B = c(1,2,4,5,6),
for_metareactions = FALSE)
res_cl3_vs_all$DC_core_metabolism_table
res_cl3_vs_all$p_DC_reactions
Figure 4.11: Number of significant reactions by subsystem, stratified by core and not core reactions.
res_cl4_vs_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 4, cluster_B = c(1,2,3,5,6),
for_metareactions = FALSE)
res_cl4_vs_all$DC_core_metabolism_table
plot_ly(
data = res_cl4_vs_all$wilcoxon_results_with_metadata,
x = ~ cohens_d,
y = ~ -log10(adjusted_p_value),
text = ~ reaction_name,
mode = "markers",
color = ~ group,
symbol = ~ core,
type = "scatter") %>%
layout(title ="Volcano Plot - Cluster 4 vs All")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Figure 4.12: Volcano plot. Each dot represents a reaction. Not significant reactions are closer to the origin of axes. The higher, the lower the adjusted p-value. The righter or lefter, the stronger the effect size.
res_cl4_vs_all$p_DC_reactions
Figure 4.13: Number of significant reactions by subsystem, stratified by core and not core reactions.
res_cl4_vs_all$p_DC_reactions_cohensd
Figure 4.14: Cohen’s D values for significant reactions by subsystem, stratified by core and not core reactions, ordered by median value.
meta_4_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 4,
cluster_B = c(1,2,3,5,6),
for_metareactions = TRUE,
adjusted_p_value_th = 0.05,
cohens_d_th = 1)
This time, the number of reactions and the number of their subsystems are reported for each metareaction, together with the Cohen’s D effect size and adjusted p-value.
meta_4_all$DC_core_metabolism_table
plot_metareaction(
met_id = c("group_101","group_438", "group_259", "group_175"),
wilcoxon_results_with_metadata = meta_4_all$wilcoxon_results_with_metadata)
Figure 4.15: Alluvium graphical representation for metareaction groups of reactions, their subsystem and metabolism type.
res_cl5_vs_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 5, cluster_B = c(1,2,3,4,6),
for_metareactions = FALSE)
res_cl5_vs_all$DC_core_metabolism_table
res_cl5_vs_all$p_DC_reactions
Figure 4.16: Number of significant reactions by subsystem, stratified by core and not core reactions.
res_cl6_vs_all <- compare_clusters(
cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 6, cluster_B = c(1,2,3,4,5),
for_metareactions = FALSE)
res_cl6_vs_all$DC_core_metabolism_table
res_cl6_vs_all$p_DC_reactions
Figure 4.17: Number of significant reactions by subsystem, stratified by core and not core reactions.
cl2_cl1 <- compare_clusters(cell_info = cell_info_with_umap_components,
compass_analyzer = compass_analyzer,
compass_data = compass_data,
variable = "cluster",
cluster_A = 2, cluster_B = 1,
for_metareactions = FALSE)
cl2_cl1$DC_core_metabolism_table
cl2_cl1$p_DC_reactions
Figure 4.18: Number of significant reactions by subsystem, stratified by core and not core reactions.